Over the past decade, a variety of different creative outlets for creating and producing music through the means of social media have been constructed. A particular site, Soundcloud, has been especially important in this recent surge of interest in the realm of Electronic Dance Music (EDM). Soundcloud has ultimately lead to the ability for anyone around the world to share their music without the rigid requirements and licenses necessary for other applications like Apple Music and Spotify. However, with this universal ability for all to have their own chance at finding success in the music industry, it has become increasingly more difficult to get yourself out there. There are 12 hours of music uploaded every minute. With such immense competition one crucial question arises: How can you engage users of the application utilizing the Soundcloud interface and find the ability to create the next most popular electronic dance song?
What initially inspired me to research into this topic was a friend of mine from high school who randomly found an interest in becoming a DJ and creating music. He decided that soundcloud was his best means of becoming “known” in the community. However, this desire has become a commonality with teenagers across the United States as the availability to produce and create has become far more accessible to all through the means of applications like Soundcloud. With this, I became curious and always wanted to see what aspects of Soundcloud people could take advantage of to truly create the new best EDM song.
I created a sample of 75 different electronic dance songs by searching within my own “likes” on the Soundcloud. I attempted to diversify each entry by a variety of factors while ensuring that they all fell into the realm of EDM music. I did this by filtering these songs by the keyword “EDM.” I utilized Excel and manually input the song data to create my final dataset using the information available on the Soundcloud website for a given track. I avoided any missing data issues by only choosing tracks with all the information I needed.
Since my variables are not entirely self-explanatory, here is a description for each one:
Name: The name of the song as it is presented on Soundcloud Artist: The name of the artist of the song as it is presented on Soundcloud Likes: The number of likes a song has Com: The number of comments posted on the song Rep: The number of reposts of the song. “Reposting” is publishing the track onto your own profile without actually posting the track yourself while still giving all credit to the original artist. engage: A variable I created to assess the success of a particular song. It is engagement The number is the total number of plays divided by the amount of total likes, comments, and reposts LiCount: The total number of times the song has been played FreeD: The availability to download the song for free. “Yes” if it could be downloaded for free, “No” if not Mon: The month the song was published Year: The year the song was published FoCount: The number of followers the artist who posted the song has NumTags: The number of hashtags utilized Len: The length of the song in seconds NumTracks: The number of total tracks an artist has posted ProUser: If the artist has purchased the “Soundcloud Pro” plan. “Yes” if the artist was a user, “No” if not NumFollow: The number of users an artist is following Remix: If the song is a remix or radio-edit of another song. “Yes” if it was a remix, “No” if not Loc: The state in which the song was published. “O” if outside the country sincerelease: The amount of time, in months, in comparison to May 2017 since the song has been published
library(lattice)
songs = read.csv("SoundcloudData.csv")
attach(songs)
###Center variables
songs$lencent = songs$Len - mean(songs$Len)
###Create dummy variables for binary variables
n = 75
# Indicator for whether the song is a remix
songs$RemixY = rep(0, n)
songs$RemixY[songs$Remix == "Yes"] = 1
# Indicator for whether the user has a pro subscription
songs$pro = rep(0, n)
songs$pro[songs$ProUser == "Yes"] = 1
# Indicator for whether the song is a free download or now
songs$fdY = rep(0, n)
songs$fdY[songs$FreeD == "Yes"] = 1
###Create Month and Year variables in order to create later sincerelease variable
songs$Month = rep(0, n)
songs$Month[songs$Mon == "January"] = 4
songs$Month[songs$Mon == "February"] = 3
songs$Month[songs$Mon == "March"] = 2
songs$Month[songs$Mon == "April"] = 1
songs$Month[songs$Mon == "May"] = 0
songs$Month[songs$Mon == "June"] = 1
songs$Month[songs$Mon == "July"] = 2
songs$Month[songs$Mon == "August"] = 3
songs$Month[songs$Mon == "September"] = 4
songs$Month[songs$Mon == "October"] = 5
songs$Month[songs$Mon == "November"] = 6
songs$Month[songs$Mon == "December"] = 7
songs$Y = rep(0, n)
songs$Y[songs$Year == "2014"] = 3
songs$Y[songs$Year == "2015"] = 2
songs$Y[songs$Year == "2016"] = 1
songs$Y[songs$Year == "2017"] = 0
###Create sincerelease variable
songs$sincerelease = abs(5 - songs$Month + 12*songs$Y)
hist(songs$sincerelease)
###Create dummy variables for seasons
n = 75
songs$winter = rep(0, n)
songs$winter[songs$Mon == "December"] = 1
songs$winter[songs$Mon == "January"] = 1
songs$winter[songs$Mon == "February"] = 1
songs$spring = rep(0, n)
songs$spring[songs$Mon == "March"] = 1
songs$spring[songs$Mon == "April"] = 1
songs$spring[songs$Mon == "May"] = 1
songs$summer = rep(0, n)
songs$summer[songs$Mon == "June"] = 1
songs$summer[songs$Mon == "July"] = 1
songs$summer[songs$Mon == "August"] = 1
songs$fall = rep(0, n)
songs$fall[songs$Mon == "September"] = 1
songs$fall[songs$Mon == "October"] = 1
songs$fall[songs$Mon == "November"] = 1
Here I created another variable that I thought would be potentially accurate predictors of engagement, called “sincerelease,” which is the amount of time, in months, in comparison to May 2017 since the song has been published. I created dummy variables for each season in hopes that these time periods may have some power in predicting engagement and also dummy variables for my predictors that were “Yes or No” predictors.
###Create engagement variable
denom = songs$Likes + songs$Com*3 + songs$Rep*2
songs$denom = denom
songs$engage = songs$LiCount/songs$denom
###Check for necessarry transformations
hist(songs$engage)
#Check to see how many songs are published inside compared to outside the United States
songs$country = rep(0, n)
songs$country[songs$Loc == "O"] = 1
###Summary of numerical variables
FCsum <- data.frame(FollowerCount = c(mean = mean(FoCount),sd = sd(FoCount)))
NTsum <- data.frame(NumberOfTags = c(mean = mean(NumTags),sd = sd(NumTags)))
LNsum <- data.frame(SongLength = c(mean = mean(Len),sd = sd(Len)))
NTsum <- data.frame(NumberOfTracks = c(mean = mean(NumTracks),sd = sd(NumTracks)))
NFsum <- data.frame(NumberOfFollowing= c(mean = mean(NumFollow),sd = sd(NumFollow)))
Esum <- data.frame(Engagement= c(mean = mean(songs$engage),sd = sd(songs$engage)))
cbind(FCsum, NTsum, LNsum, NTsum, NFsum, Esum)
## FollowerCount NumberOfTracks SongLength NumberOfTracks
## mean 260822.4 73.62667 216.42667 73.62667
## sd 835845.1 175.51698 37.91177 175.51698
## NumberOfFollowing Engagement
## mean 284.9733 23.496744
## sd 395.7318 9.799274
###Summary of categorical variables
table(songs$FreeD)/75
##
## No Yes
## 0.5066667 0.4933333
table(songs$winter)/75
##
## 0 1
## 0.7866667 0.2133333
table(songs$spring)/75
##
## 0 1
## 0.64 0.36
table(songs$summer)/75
##
## 0 1
## 0.7733333 0.2266667
table(songs$fall)/75
##
## 0 1
## 0.8 0.2
table(songs$ProUser)/75
##
## No Yes
## 0.1333333 0.8666667
table(songs$Remix)/75
##
## No Yes
## 0.4933333 0.5066667
table(songs$country)/75
##
## 0 1
## 0.6266667 0.3733333
boxplot(songs$engage~songs$FreeD, data = songs, ylab = "Engagement", xlab = "Availability to Download")
boxplot(songs$engage~songs$Mon, data = songs, ylab = "Engagement", xlab = "Month of Publication")
boxplot(songs$engage~songs$Year, data = songs, ylab = "Engagement", xlab = "Year of Publication")
boxplot(songs$engage~songs$ProUser, data = songs, ylab = "Engagement", xlab = "Artists with Soundcloud Pro Plan")
boxplot(songs$engage~songs$country, data = songs, ylab = "Engagement", xlab = "Country of Publication")
boxplot(songs$engage~songs$Remix, data = songs, ylab = "Engagement", xlab = "Remix/Radio-Edit")
plot(y = songs$engage, x = songs$NumTags)
plot(y = songs$engage, x = songs$Len)
###Transformations
plot(y = songs$engage, x = songs$FoCount, xlab = "Follower Count", ylab = "Engagement", main = "Follower Count vs. Engagement")
#data centered around 0. Need to log transform
songs$logfc = log(songs$FoCount)
plot(y = songs$engage, x = songs$logfc, xlab = "Log of Follower Count", ylab = "Engagement", main = "Log of Follower Count vs. Engagement")
plot(y = songs$engage, x = songs$NumTracks, xlab = "Number of Tracks", ylab = "Engagement", main = "Number of Tracks vs. Engagement")
#data again centered around 0. Need to log transform
songs$lognt = log(songs$NumTracks)
plot(y = songs$engage, x = songs$lognt, xlab = "Log of Number of Tracks", ylab = "Engagement", main = "Log Number of Tracks vs. Engagement")
plot(y = songs$engage, x = songs$NumFollow, xlab = "Number of Following", ylab = "Engagement", main = "Number of Following vs. Engagement")
#somewhat linear, slightly packed around 0. try a log transformation
songs$lognf = log(songs$NumFollow + 0.01)
plot(y = songs$engage, x = songs$lognf, xlab = "Log of Number of Following", ylab = "Engagement", main = "Log of Number of Following vs. Engagement")
#Interactions and Collinearity
plot(y = songs$logfc, x = songs$lognt, xlab = "Log of Follower Count", ylab = "Log of Number of Tracks", main = "Log of Follower Count vs. Log of Number of Tracks")
#Check correlation. Plot seems like the two are somewhat associated
cor(songs$logfc, songs$lognt)
## [1] 0.5597896
#Correlation is somewhat high. Try creating a model with 1. nt and nf and 2. with nf and fc and compare them. Pick the better one as the starting model.
plot(y = songs$logfc, x = songs$NumTags)
plot(y = songs$lencent, x = songs$NumTags)
plot(y = songs$logfc, x = songs$lognf)
plot(y = songs$logfc, x = songs$Remix)
plot(x = songs$ProUser, y = songs$lognf)
plot(y = songs$ProUser, x = songs$FreeD)
plot(x = songs$Loc, y = songs$lognt)
plot(x = songs$Loc, y = songs$logfc)
plot(x = songs$Loc, y = songs$lognf)
xyplot(songs$engage~songs$logfc | songs$country, data = songs)
xyplot(songs$engage~songs$lognt | songs$country, data = songs)
xyplot(songs$engage~songs$lognf | songs$country, data = songs)
xyplot(songs$engage~songs$logfc | songs$FreeD, data = songs)
xyplot(songs$engage~songs$lognt | songs$FreeD, data = songs)
xyplot(songs$engage~songs$lognf | songs$FreeD, data = songs)
xyplot(songs$engage~songs$logfc | songs$sincerelease, data = songs)
xyplot(songs$engage~songs$lognt | songs$sincerelease, data = songs)
xyplot(songs$engage~songs$lognf | songs$sincerelease, data = songs)
The main reason for obtaining the number of likes, comments, reposts and number of plays was to create this variable called “engage” which ultimately gauges the success of a particular song. A soundcloud user just by clicking the play button on a particular song or having the song randomly come on next on shuffle will add to the play count even if he or she immediately clicks away. Therefore, to tell if a song is truly doing well or not, its crucial to check the “engagement” of the track with its audience. By commenting, reposting, or liking a particular song we can tell a user has shown particular interest in the track; he or she simply hasnt just clicked the play button. However, the level of engagement varies by each action. Liking dipslays the least amount of engagement as it basically just telling yourself that you are a fan of the track. Reposting, on the other hand, displays higher engagement because you enjoy the track enough that you want your followers to listen to it as well. However, commenting takes the most engagement because you are taking the time to express to the artist either how much you liked or disliked the track. This action also takes the most time to complete. Due to this, I added different multipliers to each of the variables in the engagement equation. By dividing the total number of plays by each of these variables I hoped to create a standardized variable to compare each track to eachother based on engagement with users throughout the application.
With this particular gauge, the lower the number of “engagement” the better. The closer the value is to 1 (the total number of plays divided by the denominator is exactly the same), the proportion of people who engage with the track increases. The difference between smaller values, like 1 and 2 for example, is extremely large; the engagement decreases by 50%. As the values become larger, the difference between two becomes far less significant. 1000/1 & 1000/2 is an extreme difference while 1000/15 and 1000/16 is not nearly as large of a change.
The summary statistics (mean and standard deviation/percentage) of each variable are presented above. Using the correlation function, I checked to see if any of the predictor variables were highly associated with one another. To no particular surprise, I found nothing significant or concerning in the table.
However, when checking the plots of a few continuous variables, I discovered the need for transformation. For Follower Count, Number of Tracks, and Number Following, the datapoints were closely packed towards the left side of the graph, indicating the need for a logarithmic transformation. Follower Count and Number of Tracks both show a far more linear relationship, however not strong, while Number Following is somewhat more linear. I decided for keep each of these variables transformed to utilize in my regression model attempts.
###Model 1: Full
songsreg = lm(engage ~ sincerelease + fdY + summer + fall + winter + NumTags + lencent + pro + RemixY + lognf + logfc + country, data = songs)
summary(songsreg)
##
## Call:
## lm(formula = engage ~ sincerelease + fdY + summer + fall + winter +
## NumTags + lencent + pro + RemixY + lognf + logfc + country,
## data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7408 -4.8601 0.0417 4.4309 21.0930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.19168 7.55940 1.613 0.11187
## sincerelease 0.48207 0.16739 2.880 0.00545 **
## fdY -3.17465 2.12247 -1.496 0.13980
## summer -2.26219 3.01191 -0.751 0.45545
## fall 4.66381 2.81098 1.659 0.10214
## winter 4.38417 2.89070 1.517 0.13444
## NumTags 0.33582 0.19815 1.695 0.09513 .
## lencent 0.04827 0.02852 1.692 0.09557 .
## pro -0.79413 3.66740 -0.217 0.82928
## RemixY 2.63560 2.11010 1.249 0.21635
## lognf 0.41059 0.48674 0.844 0.40216
## logfc 0.09376 0.63550 0.148 0.88318
## country 1.65059 2.16243 0.763 0.44818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.458 on 62 degrees of freedom
## Multiple R-squared: 0.3759, Adjusted R-squared: 0.2551
## F-statistic: 3.112 on 12 and 62 DF, p-value: 0.001681
songscheck = lm(engage ~ sincerelease + fdY + summer + fall + winter + NumTags + lencent + pro + RemixY + lognf + lognt + country, data = songs)
summary(songscheck)
##
## Call:
## lm(formula = engage ~ sincerelease + fdY + summer + fall + winter +
## NumTags + lencent + pro + RemixY + lognf + lognt + country,
## data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4152 -4.4685 -0.0326 4.5627 21.4728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.23303 5.25528 2.899 0.00518 **
## sincerelease 0.50743 0.16479 3.079 0.00309 **
## fdY -3.64177 2.13884 -1.703 0.09364 .
## summer -1.98166 2.99478 -0.662 0.51061
## fall 5.02652 2.83251 1.775 0.08088 .
## winter 4.77833 2.91915 1.637 0.10672
## NumTags 0.32767 0.19306 1.697 0.09467 .
## lencent 0.04465 0.02833 1.576 0.12010
## pro -0.13823 3.20016 -0.043 0.96569
## RemixY 2.41221 2.07068 1.165 0.24851
## lognf 0.30937 0.47239 0.655 0.51496
## lognt -0.69402 0.86397 -0.803 0.42487
## country 1.75445 2.15292 0.815 0.41824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.415 on 62 degrees of freedom
## Multiple R-squared: 0.3821, Adjusted R-squared: 0.2625
## F-statistic: 3.195 on 12 and 62 DF, p-value: 0.001328
anova(songsreg, songscheck)
## Analysis of Variance Table
##
## Model 1: engage ~ sincerelease + fdY + summer + fall + winter + NumTags +
## lencent + pro + RemixY + lognf + logfc + country
## Model 2: engage ~ sincerelease + fdY + summer + fall + winter + NumTags +
## lencent + pro + RemixY + lognf + lognt + country
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 62 4435.0
## 2 62 4390.9 0 44.142
songs1 = songscheck
coef(songs1)
## (Intercept) sincerelease fdY summer fall
## 15.23303337 0.50742998 -3.64177343 -1.98166429 5.02652417
## winter NumTags lencent pro RemixY
## 4.77833136 0.32766501 0.04464825 -0.13822530 2.41221242
## lognf lognt country
## 0.30936550 -0.69402059 1.75445179
###Model 2: Dropped "pro"
songsreg2 = lm(engage ~ sincerelease + fdY + summer + fall + winter + NumTags + lencent + RemixY + lognf + lognt + country, data = songs)
summary(songsreg2)
##
## Call:
## lm(formula = engage ~ sincerelease + fdY + summer + fall + winter +
## NumTags + lencent + RemixY + lognf + lognt + country, data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4011 -4.4261 -0.0479 4.5571 21.4603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.12213 4.54886 3.324 0.00148 **
## sincerelease 0.50752 0.16347 3.105 0.00285 **
## fdY -3.65517 2.09939 -1.741 0.08655 .
## summer -1.94738 2.86474 -0.680 0.49914
## fall 5.02537 2.80986 1.788 0.07851 .
## winter 4.77619 2.89552 1.650 0.10402
## NumTags 0.32913 0.18853 1.746 0.08572 .
## lencent 0.04463 0.02810 1.588 0.11725
## RemixY 2.41883 2.04858 1.181 0.24215
## lognf 0.30775 0.46717 0.659 0.51245
## lognt -0.69963 0.84737 -0.826 0.41212
## country 1.76035 2.13149 0.826 0.41199
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.349 on 63 degrees of freedom
## Multiple R-squared: 0.3821, Adjusted R-squared: 0.2742
## F-statistic: 3.541 on 11 and 63 DF, p-value: 0.0006648
anova(songs1, songsreg2, test = "Chisq")
## Analysis of Variance Table
##
## Model 1: engage ~ sincerelease + fdY + summer + fall + winter + NumTags +
## lencent + pro + RemixY + lognf + lognt + country
## Model 2: engage ~ sincerelease + fdY + summer + fall + winter + NumTags +
## lencent + RemixY + lognf + lognt + country
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 62 4390.9
## 2 63 4391.0 -1 -0.13213 0.9655
###Model 3: Dropped "lognf"
songsreg3 = lm(engage ~ sincerelease + fdY + summer + fall + winter + NumTags + lencent + RemixY + lognt + country, data = songs)
summary(songsreg3)
##
## Call:
## lm(formula = engage ~ sincerelease + fdY + summer + fall + winter +
## NumTags + lencent + RemixY + lognt + country, data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7134 -4.3422 -0.2001 4.6952 21.9944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.60849 3.93233 4.224 7.76e-05 ***
## sincerelease 0.51419 0.16243 3.166 0.00237 **
## fdY -3.48215 2.07367 -1.679 0.09798 .
## summer -1.47487 2.76119 -0.534 0.59509
## fall 5.05845 2.79696 1.809 0.07522 .
## winter 5.12790 2.83326 1.810 0.07501 .
## NumTags 0.30805 0.18497 1.665 0.10071
## lencent 0.04359 0.02793 1.561 0.12357
## RemixY 2.58459 2.02406 1.277 0.20624
## lognt -0.81174 0.82642 -0.982 0.32968
## country 1.45970 2.07283 0.704 0.48386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.312 on 64 degrees of freedom
## Multiple R-squared: 0.3778, Adjusted R-squared: 0.2806
## F-statistic: 3.886 on 10 and 64 DF, p-value: 0.000376
anova(songsreg3, songsreg2, test = "Chisq")
## Analysis of Variance Table
##
## Model 1: engage ~ sincerelease + fdY + summer + fall + winter + NumTags +
## lencent + RemixY + lognt + country
## Model 2: engage ~ sincerelease + fdY + summer + fall + winter + NumTags +
## lencent + RemixY + lognf + lognt + country
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 64 4421.2
## 2 63 4391.0 1 30.246 0.5101
###Model 5: Dropped "country"
songsreg4 = lm(engage ~ sincerelease + fdY + summer + fall + winter + NumTags + lencent + RemixY + lognt, data = songs)
summary(songsreg4)
##
## Call:
## lm(formula = engage ~ sincerelease + fdY + summer + fall + winter +
## NumTags + lencent + RemixY + lognt, data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1834 -3.6466 0.0768 4.9550 22.7814
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.62544 3.91698 4.244 7.11e-05 ***
## sincerelease 0.52067 0.16154 3.223 0.00198 **
## fdY -3.53449 2.06429 -1.712 0.09163 .
## summer -1.32693 2.74249 -0.484 0.63013
## fall 5.18780 2.78007 1.866 0.06655 .
## winter 5.45979 2.78293 1.962 0.05406 .
## NumTags 0.32510 0.18266 1.780 0.07978 .
## lencent 0.04146 0.02766 1.499 0.13873
## RemixY 2.62976 2.01518 1.305 0.19650
## lognt -0.74728 0.81815 -0.913 0.36442
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.279 on 65 degrees of freedom
## Multiple R-squared: 0.373, Adjusted R-squared: 0.2862
## F-statistic: 4.296 on 9 and 65 DF, p-value: 0.0002081
anova(songsreg4, songsreg3, test = "Chisq")
## Analysis of Variance Table
##
## Model 1: engage ~ sincerelease + fdY + summer + fall + winter + NumTags +
## lencent + RemixY + lognt
## Model 2: engage ~ sincerelease + fdY + summer + fall + winter + NumTags +
## lencent + RemixY + lognt + country
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 65 4455.5
## 2 64 4421.2 1 34.258 0.4813
###Model 6: Dropped "lognt"
songsreg5 = lm(engage ~ sincerelease + fdY + summer + fall + winter + NumTags + lencent + RemixY, data = songs)
summary(songsreg5)
##
## Call:
## lm(formula = engage ~ sincerelease + fdY + summer + fall + winter +
## NumTags + lencent + RemixY, data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5593 -4.3144 0.1037 4.7141 22.2875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.18557 2.86122 4.958 5.26e-06 ***
## sincerelease 0.49845 0.15950 3.125 0.00264 **
## fdY -3.04446 1.99084 -1.529 0.13099
## summer -1.37505 2.73854 -0.502 0.61726
## fall 4.75675 2.73628 1.738 0.08680 .
## winter 5.03538 2.74041 1.837 0.07065 .
## NumTags 0.32401 0.18243 1.776 0.08033 .
## lencent 0.04490 0.02737 1.641 0.10560
## RemixY 2.88202 1.99366 1.446 0.15302
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.269 on 66 degrees of freedom
## Multiple R-squared: 0.3649, Adjusted R-squared: 0.288
## F-statistic: 4.741 on 8 and 66 DF, p-value: 0.0001263
anova(songsreg5, songsreg4, test = "Chisq")
## Analysis of Variance Table
##
## Model 1: engage ~ sincerelease + fdY + summer + fall + winter + NumTags +
## lencent + RemixY
## Model 2: engage ~ sincerelease + fdY + summer + fall + winter + NumTags +
## lencent + RemixY + lognt
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 66 4512.7
## 2 65 4455.5 1 57.186 0.361
###Model 7: Dropped "fdY"
songsreg6 = lm(engage ~ sincerelease + summer + fall + winter + NumTags + lencent + RemixY, data = songs)
summary(songsreg6)
##
## Call:
## lm(formula = engage ~ sincerelease + summer + fall + winter +
## NumTags + lencent + RemixY, data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.0797 -4.5270 -0.0263 5.6520 23.6395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.25665 2.59371 4.726 1.22e-05 ***
## sincerelease 0.50539 0.16102 3.139 0.00252 **
## summer -0.44271 2.69635 -0.164 0.87008
## fall 5.55963 2.71213 2.050 0.04429 *
## winter 5.80647 2.72040 2.134 0.03647 *
## NumTags 0.31683 0.18418 1.720 0.09001 .
## lencent 0.04560 0.02763 1.650 0.10362
## RemixY 2.60872 2.00537 1.301 0.19776
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.351 on 67 degrees of freedom
## Multiple R-squared: 0.3424, Adjusted R-squared: 0.2737
## F-statistic: 4.985 on 7 and 67 DF, p-value: 0.0001395
anova(songsreg5, songsreg6, test = "Chisq")
## Analysis of Variance Table
##
## Model 1: engage ~ sincerelease + fdY + summer + fall + winter + NumTags +
## lencent + RemixY
## Model 2: engage ~ sincerelease + summer + fall + winter + NumTags + lencent +
## RemixY
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 66 4512.7
## 2 67 4672.6 -1 -159.9 0.1262
###Final Model: Dropped "RemixY"
songsreg7 = lm(engage ~ sincerelease + summer + fall + winter + NumTags + Len, data = songs)
summary(songsreg7)
##
## Call:
## lm(formula = engage ~ sincerelease + summer + fall + winter +
## NumTags + Len, data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.7796 -4.5013 -0.2586 4.7382 24.1776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.36538 5.80998 0.407 0.68520
## sincerelease 0.51986 0.16145 3.220 0.00197 **
## summer -0.53625 2.70908 -0.198 0.84368
## fall 5.61127 2.72561 2.059 0.04335 *
## winter 6.46984 2.68574 2.409 0.01871 *
## NumTags 0.34125 0.18415 1.853 0.06821 .
## Len 0.04972 0.02759 1.802 0.07596 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.393 on 68 degrees of freedom
## Multiple R-squared: 0.3258, Adjusted R-squared: 0.2663
## F-statistic: 5.477 on 6 and 68 DF, p-value: 0.0001134
anova(songsreg7, songsreg6, test = "Chisq")
## Analysis of Variance Table
##
## Model 1: engage ~ sincerelease + summer + fall + winter + NumTags + Len
## Model 2: engage ~ sincerelease + summer + fall + winter + NumTags + lencent +
## RemixY
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 68 4790.6
## 2 67 4672.6 1 118.02 0.1933
songsfinal = songsreg6
summary(songsfinal)
##
## Call:
## lm(formula = engage ~ sincerelease + summer + fall + winter +
## NumTags + lencent + RemixY, data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.0797 -4.5270 -0.0263 5.6520 23.6395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.25665 2.59371 4.726 1.22e-05 ***
## sincerelease 0.50539 0.16102 3.139 0.00252 **
## summer -0.44271 2.69635 -0.164 0.87008
## fall 5.55963 2.71213 2.050 0.04429 *
## winter 5.80647 2.72040 2.134 0.03647 *
## NumTags 0.31683 0.18418 1.720 0.09001 .
## lencent 0.04560 0.02763 1.650 0.10362
## RemixY 2.60872 2.00537 1.301 0.19776
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.351 on 67 degrees of freedom
## Multiple R-squared: 0.3424, Adjusted R-squared: 0.2737
## F-statistic: 4.985 on 7 and 67 DF, p-value: 0.0001395
confint(songsfinal)
## 2.5 % 97.5 %
## (Intercept) 7.079590210 17.4337052
## sincerelease 0.184003248 0.8267791
## summer -5.824645226 4.9392345
## fall 0.146190452 10.9730760
## winter 0.376523986 11.2364072
## NumTags -0.050800720 0.6844511
## lencent -0.009561036 0.1007560
## RemixY -1.394018853 6.6114644
In order to make a start on my modeling process, I decided to include every single explanatory variable in order to predict the amount of engagement. Due to the high correlation between the two variables that I looked at in my EDA, I made two seperate models including (one including nf and nt and the other including nf and fc) and picked the starting model based on the one which provided me with a higher R-squared value considering the predictors were essentially the same (the anova between the two showed no difference). I ended up creating a total of 8 different models before ending up with my final model. Checking the signficance of particular predictors and utilizing the anova command, I decided to drop the predictors for if the artist was a professional user, the log follower count/number following/number of tracks, and the country it was released in. I ultimately decided to keep if the song was a remix or not as a predictor because it makes sense for final interpretation although it is not statistically significant. I used spring as the baseline for the season predictor.
With the predictors I am using I didn’t expect to have any interactions between any two variables. Using the lattice library and xyplots, I checked for interactions between all categorical variables and found nothing significant. All of the variables that may have had potential for interactions were eventually dropped anyway.
leverage = hatvalues(songsfinal)
cooks = cooks.distance(songsfinal)
d2 = cbind(songs, leverage, cooks)
hist(leverage, title = "Leverage values for songs regression")
## Warning in plot.window(xlim, ylim, "", ...): "title" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## "title" is not a graphical parameter
## Warning in axis(1, ...): "title" is not a graphical parameter
## Warning in axis(2, ...): "title" is not a graphical parameter
d2[d2$leverage > .3,]
## Name Artist Likes Com Rep LiCount FreeD Mon Year
## 18 Bad David Guetta & Showtek 441000 9094 97700 29400000 No March 2014
## FoCount NumTags Len NumTracks ProUser NumFollow Remix Loc lencent
## 18 492000 5 270 139 Yes 10 No O 53.57333
## RemixY pro fdY Month Y sincerelease winter spring summer fall denom
## 18 0 1 0 2 3 39 0 1 0 0 663682
## engage country logfc lognt lognf leverage cooks
## 18 44.29832 1 13.10623 4.934474 2.303585 0.3489056 0.1017359
hist(cooks)
d2[d2$cooks > .15,]
## Name Artist Likes Com Rep LiCount FreeD Mon
## 57 I Wanna Know Alesso ft. Nico & Vinz 15500 119 6574 424000 Yes May
## Year FoCount NumTags Len NumTracks ProUser NumFollow Remix Loc lencent
## 57 2016 90900 14 331 230 Yes 0 Yes O 114.5733
## RemixY pro fdY Month Y sincerelease winter spring summer fall denom
## 57 1 1 1 0 1 17 0 1 0 0 29005
## engage country logfc lognt lognf leverage cooks
## 57 14.61817 1 11.41752 5.438079 -4.60517 0.1898438 0.1774061
songsdif = songs[-18,]
songsdif = songsdif[-56,]
songsdif$centeredl = songsdif$Len - mean(songsdif$Len)
songsfinaldif = lm(engage ~ sincerelease + summer + fall + winter + NumTags + Len, data = songsdif)
summary(songsfinaldif)
##
## Call:
## lm(formula = engage ~ sincerelease + summer + fall + winter +
## NumTags + Len, data = songsdif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.288 -4.978 0.222 4.403 22.925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.77373 5.99989 -0.296 0.7684
## sincerelease 0.45119 0.18353 2.458 0.0166 *
## summer -1.13834 2.80630 -0.406 0.6863
## fall 5.29752 2.72538 1.944 0.0562 .
## winter 5.71477 2.63872 2.166 0.0339 *
## NumTags 0.41461 0.18136 2.286 0.0255 *
## Len 0.07265 0.02857 2.543 0.0133 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.137 on 66 degrees of freedom
## Multiple R-squared: 0.3372, Adjusted R-squared: 0.2769
## F-statistic: 5.595 on 6 and 66 DF, p-value: 9.699e-05
summary(songsfinal)
##
## Call:
## lm(formula = engage ~ sincerelease + summer + fall + winter +
## NumTags + lencent + RemixY, data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.0797 -4.5270 -0.0263 5.6520 23.6395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.25665 2.59371 4.726 1.22e-05 ***
## sincerelease 0.50539 0.16102 3.139 0.00252 **
## summer -0.44271 2.69635 -0.164 0.87008
## fall 5.55963 2.71213 2.050 0.04429 *
## winter 5.80647 2.72040 2.134 0.03647 *
## NumTags 0.31683 0.18418 1.720 0.09001 .
## lencent 0.04560 0.02763 1.650 0.10362
## RemixY 2.60872 2.00537 1.301 0.19776
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.351 on 67 degrees of freedom
## Multiple R-squared: 0.3424, Adjusted R-squared: 0.2737
## F-statistic: 4.985 on 7 and 67 DF, p-value: 0.0001395
In order to finish my final model, I decided to look if there were any potential leverage points that I could take out to increase the predicting power of my regression model. Checking the leverage and cooks distance values, I noticed that three particular points (“Bad” & “I Wanna Know”) had particularly high values in both of those categories. I decided to take these points out and re-run the regression model. Taking out these specific points actually decreased my R-squared value by 0.2 and there was no significant impact on the standard errors of each of the predictor variables.
plot(songsfinal$resid, x=NumTags, ylab = "Residuals", xlab = "Number of Tags") + abline(0,0)
## numeric(0)
plot(songsfinal$resid, x=songs$lencent, ylab = "Residuals", xlab = "Length") + abline(0,0)
## numeric(0)
plot(songsfinal$resid, x=songs$logfc, ylab = "Residuals", xlab = "Log of Follower Count") + abline(0,0)
## numeric(0)
plot(songsfinal$resid, x=songs$lognt, ylab = "Residuals", xlab = "Log of Number of Tracks") + abline(0,0)
## numeric(0)
plot(songsfinal$resid, x=songs$lognf, ylab = "Residuals", xlab = "Log of Number of Following") + abline(0,0)
## numeric(0)
boxplot(songsfinal$resid~songs$FreeD, ylab = "Residuals", xlab = "Availability to Download")
boxplot(songsfinal$resid~songs$winter, ylab = "Residuals", xlab = "Winter")
boxplot(songsfinal$resid~songs$summer, ylab = "Residuals", xlab = "Summer")
boxplot(songsfinal$resid~songs$spring, ylab = "Residuals", xlab = "Spring")
boxplot(songsfinal$resid~songs$fall, ylab = "Residuals", xlab = "Fall")
boxplot(songsfinal$resid~songs$ProUser, ylab = "Residuals", xlab = "Soundcloud Pro User")
boxplot(songsfinal$resid~songs$RemixY, ylab = "Residuals", xlab = "Remix")
boxplot(songsfinal$resid~songs$country, ylab = "Residuals", xlab = "Country")
summary(songsfinal)
##
## Call:
## lm(formula = engage ~ sincerelease + summer + fall + winter +
## NumTags + lencent + RemixY, data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.0797 -4.5270 -0.0263 5.6520 23.6395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.25665 2.59371 4.726 1.22e-05 ***
## sincerelease 0.50539 0.16102 3.139 0.00252 **
## summer -0.44271 2.69635 -0.164 0.87008
## fall 5.55963 2.71213 2.050 0.04429 *
## winter 5.80647 2.72040 2.134 0.03647 *
## NumTags 0.31683 0.18418 1.720 0.09001 .
## lencent 0.04560 0.02763 1.650 0.10362
## RemixY 2.60872 2.00537 1.301 0.19776
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.351 on 67 degrees of freedom
## Multiple R-squared: 0.3424, Adjusted R-squared: 0.2737
## F-statistic: 4.985 on 7 and 67 DF, p-value: 0.0001395
confint(songsfinal)
## 2.5 % 97.5 %
## (Intercept) 7.079590210 17.4337052
## sincerelease 0.184003248 0.8267791
## summer -5.824645226 4.9392345
## fall 0.146190452 10.9730760
## winter 0.376523986 11.2364072
## NumTags -0.050800720 0.6844511
## lencent -0.009561036 0.1007560
## RemixY -1.394018853 6.6114644
Looking at the residuals versus all predictors, both continuous and categorical, each plot maintains a random scatter with constant variance. Although the explanatory power of the model is pretty low – the R-squared value is 34.24% – we can be fairly confident that this model is the best possible model given the data available.
After this inference process, it appears there are only a few out of the total variables that are helpful in predicting the success of a particular Soundcloud track. These are: The time elapsed since it was released, the season it was released, the number of tags utilized, the length of the song, and if the song was a remix/ radio-edit or not.
For a soundcloud track that is not a remix or radio edit that has been released in the spring, with average number of tags and average song length, the predicted engagement is around 12 (12.25665).
For every month that has passed since the release date of the song, the number of engagement increases by 0.50539. This is quite reasonable due to the fact that after time passes, people who have already engaged (liked/commented/reposted) with the track now just listen to it. Tracks are most engaged with within the first month of their release. Therefore, as time passes the value of the engagement variable gets farther from 1 and less people interact with the track
Holding all else constant, the engagement with a particular track is 2.60872 points higher for those songs that are remixes and radio edits compared to those that are original songs. This actually makes sense when considering the use of the Soundcloud application as there are usually a high variety of remixes on a particular track, so the engagement becomes spread between these tracks (while there is only one version of the original track for users to engage with)
With one more tag added onto a track, the number of engagement increases by 0.31683. This is due to the fact that increasing the number of hashtags utilized on a track will increase the chance of a particular song being found (especially the more general tag used like “#EDM” for example). When you want to listen to particular types of music, you can use these hashtags and play a variety of songs that maintain the keyword. The search feature on Soundcloud allows you to search through hashtags, so users can search #EDM and find only songs with that particular tag and have a higher chance of listening to it rather than searching for something specific and liking/commenting/reposting.
When comparing the timing of the release of a particular track, it appears that engagement is -0.44271 lower in the Summer than in the Spring, 5.55963 higher in the Fall than in the Spring, and 5.80647 higher in the Winter than in the Spring. In the summer and spring months users are more engaged with EDM music than in the fall or winter.
newdata = songs[1,]
newdata$Name = "Song x"
newdata$Artist = "Artist x"
newdata$Likes = 1000
newdata$Com = 10
newdata$Rep = 50
newdata$LiCount = 10000
newdata$sincerelease = 0
newdata$fdY = 0
newdata$FoCount = 1000
newdata$NumTags = 10
newdata$Len = 180
newdata$lencent = 180 - mean(Len)
newdata$NumTracks = 10
newdata$ProUser = 0
newdata$NumFollow = 100
newdata$RemixY = 0
newdata$country = 0
newdata2 = songs[1,]
newdata2$Name = "Song x"
newdata2$Artist = "Artist x"
newdata2$Likes = 1000
newdata2$Com = 10
newdata2$Rep = 50
newdata2$LiCount = 10000
newdata2$sincerelease = 0
newdata2$fdY = 0
newdata2$FoCount = 1000
newdata2$NumTags = 10
newdata2$Len = 240
newdata2$lencent = 240 - mean(Len)
newdata2$NumTracks = 10
newdata2$ProUser = 0
newdata2$NumFollow = 100
newdata2$RemixY = 0
newdata2$country = 0
newdata
## Name Artist Likes Com Rep LiCount FreeD Mon Year FoCount NumTags
## 1 Song x Artist x 1000 10 50 10000 Yes April 2017 1000 10
## Len NumTracks ProUser NumFollow Remix Loc lencent RemixY pro fdY Month
## 1 180 10 0 100 Yes IL -36.42667 0 1 0 1
## Y sincerelease winter spring summer fall denom engage country logfc
## 1 0 0 0 1 0 0 9770 15.35312 0 10.93311
## lognt lognf
## 1 2.70805 6.180037
newdata2
## Name Artist Likes Com Rep LiCount FreeD Mon Year FoCount NumTags
## 1 Song x Artist x 1000 10 50 10000 Yes April 2017 1000 10
## Len NumTracks ProUser NumFollow Remix Loc lencent RemixY pro fdY Month
## 1 240 10 0 100 Yes IL 23.57333 0 1 0 1
## Y sincerelease winter spring summer fall denom engage country logfc
## 1 0 0 0 1 0 0 9770 15.35312 0 10.93311
## lognt lognf
## 1 2.70805 6.180037
predeng = predict(songsfinal, newdata, interval = "prediction")
predeng
## fit lwr upr
## 1 13.76393 -3.647572 31.17544
predeng2 = predict(songsfinal, newdata2, interval = "prediction")
predeng2
## fit lwr upr
## 1 16.49978 -1.09276 34.09233
Using the prediction function in R, I created two tracks with everything constant besides their length in order to see the impact of length on engagement. One had a song length of 3 minutes and the other 4 minutes (180 and 240 seconds). With my prediction, it seems that holding all else constant, increasing the song length by 1 minute (60 seconds) will ultimately increase engagement by around 3 points (16.49978 - 13.76393).
While studying this particular topic, I realized how difficult it is to create an accurate model to predict the engagement of a particular song due to the non-linear trend of likes, comments, and reposts over time. With each month, the number of likes/comments/reposts change randomly - they may increase or decrease as songs can come back into the spotlight after years due to spontaneous changes in preferences in popular culture - and its hard to account for this. I attempted to do so with the sincerelease variable. I also only used 75 songs that were based upon my own preferences with this type of music due to the unavailability of public datasets for Soundcloud online. This is quite a large limitation; further studies should use a much larger dataset to attempt to see which variables are truly good predictors of engagement. My scope of inference is also limited to just EDM music on the website; there is a variety of other types of music available that can be researched in further studies.
In the end, it appears that the season of release, the time elapsed since the track has been released, the number of tags utilized, the length of the song, and if the song is a remix or not are valid predictors of user engagement with a particular track.
###DJ Friend checking potential
nd = songs[1,]
nd$Name = "Oceans"
nd$Artist = "Dexter"
nd$sincerelease = 1
nd$Likes = 117
nd$Com = 4
nd$Rep = 18
nd$LiCount = 2819
nd$engage = 2819/(117+4*3+18*2)
nd$fdY = 0
nd$FoCount = 3786
nd$NumTags = 0
nd$Len = 220
nd$NumTracks = 13
nd$ProUser = 0
nd$NumFollow = 76
nd$RemixY = 1
nd$country = 0
nd
## Name Artist Likes Com Rep LiCount FreeD Mon Year FoCount NumTags Len
## 1 Oceans Dexter 117 4 18 2819 Yes April 2017 3786 0 220
## NumTracks ProUser NumFollow Remix Loc lencent RemixY pro fdY Month Y
## 1 13 0 76 Yes IL 15.57333 1 1 0 1 0
## sincerelease winter spring summer fall denom engage country logfc
## 1 1 0 1 0 0 9770 17.08485 0 10.93311
## lognt lognf
## 1 2.70805 6.180037
predeng = predict(songsfinal, nd, interval = "prediction")
predeng
## fit lwr upr
## 1 16.08087 -1.537933 33.69967
difference = nd$engage - 16.08087
difference
## [1] 1.003978
I decided to utilize my regression model to see if I could accurately predict my friend’s track engagement. He posted a song exactly a month ago, and despite my seemingly poor model, I predicted the engagement with my friend Dexter’s track not too poorly (a difference of 1.003978). With this, I feel that maybe my model may be far more accurate predicting tracks that fall within one or two months of the release date.